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ABSTRACT 


We present the variational equations for maximizing the probability of 
correct classification as a function of a lxn feature selection matrix B 
for the two population problem. For the special case of equal covariance 
matrices the optimal B is unique up to scalar multiples and rank one 
sufficient. For equal population means, the best lxn B is an eigenvector 
corresponding either to the largest or smallest eigenvalue of where 

and are the nXn covariance matrices of the two populations. The 

transformed probability of correct classification depends only on the eigen- 
value. Finally, a procedure is proposed for constructing an optimal or 
nearly optimal kxn matrix of rank k without solving the k— dimensional 
variational equation. 
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Results on the Two Population. Feature 
Selection Problem Using Probability of 
Correct Classification as a Criterion 

by 

B.C. Peters, Jr. 


1. Introduction 

Let and 7^ be n-variate normally distributed populations with 

conditional densities P^(x) ~ NGjpE^) and P 2 (x) ~ N(y 2> £ 2 ) an< ^ a P r i° r i 
probabilities and a 2 respectively. In this note we consider some 

special cases of the problem of selecting a l*n nonzero vector B which 
maximizes the transformed probability of correct classification 

h(B) = J max[a^P^(y,B) , a 2 P 2 (y ,B) ]dy, 

R 

T 

where P^(y,B) ~ N(By^, BE 3 ) are the conditional densities of the variable 

y = Bx, i = 1,2. We assume the maximum likelihood classifier: assign x to 

ir^ if a 1 P 1 (Bx,B) £ a 2 P 2 (Bx,B); otherwise, assign x to II 2 . 

It is shown in [2] that for the B which maximizes h(B), the 

Sateaux differential 6h(B;C) = lim ^ exist for all l x n vectors 

s-»-o S 


C and 
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(1) 


6h(B;C) = a x J SP ]L (y,B;C)dy + a 2 P 2 (y,B;C)dy where 


the 


(B) 


R 2 (B) 


R i (B) are the Bayes regions 


R-^B) = {y e R | c^P^B) > a^Cy.B)} 
R 2 <B) = {y e R | Dt 1 P 1 Cy 9 B) < a 2 P 2 (y,B)> 


Moreover, [1], 




(2) 6P i (y,B;C) = P^y.B^ ^-T (y ~ BR L r 

(bS^b ) 


Cy. CE.B T ) 

+ — ^ Cy - By ) - — ^y- } 
BS.B B2LB J 


Substituting (2) into (1) and integrating by parts gives 


CEjB 


(3) 6h(B;C) = -a 1 P 1 (y,B) ■ y (y " B U ± ) + C \i ± 




^(B) 


_a 2 P 2^ y,B ^ 


cE 2 b 


BE 2 B 


j(y - By 2 ) + Cy 2 


r 2 (b) 


In order to determine R-^(B) and R 2 (B) it is necessary to solve the 
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equation a^P^(y,B) = a 2^2^ ,B ^ whose roots are those of the discriminant 
function 


H(y,B) = a(B)y 2 + 23(B)y + Y(B), 


where 


a(B) = MX X - 2 2 )B T 

3(B) = (bz 2 b t )bu 1 - (BZ 1 B T )B]i 2 
y(B) = (BE^B T ) (By 2 ) 2 - (BE 2 B T )(B yi ) 2 

T T BS 2 BT a i 2 

+ (BE B; (BE B 1 ) [In — + In ], 

1 Z BZjB 1 a 2 

We are not interested in the case where H(y,B) = 0 has no real roots or 
holds identically, since in this case we always have h(B) = max{a^,a 2 }, 
which is the minimum value that h(B) can attain. 


2. The Equal Covariance Case 


If Z 1 = S 2 = E, then a(B) = 0 and H(y,B) 

2 


= 0 has the single root 


a = 


B( yi + y 2 ) 


ft) 


BEB In'* 2 
2B( y;L -li 2 ) 


For either R^(B) = (-°°,a) or R 2 (B) » (-“ja) substitution into equation 
(3) yields 
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r7 k 

<Sh£B;C) = C(y 1 - y 2 ) - B(y^ - y 2 ) . 

B2B 


Thus, for the optimal B, 


y l - v 2 * B(lJ l - U 2>' 


which may be rewritten as 




It is readily verified that 


B o ■ 

satisfies this equation and that any other solution must be a scalar multiple 
of B q * ' Since h(XB Q ) - h(B Q ) for A f 0, maximizes h(B). The 

corresponding probability of correct classification is 

h(B Q ) * erf(| 7(y 1 '-y 2 ) T r 1 (y 1 -y 2 ) ). 

A nonzero lxn vector B is called sufficient if h(B) = PCC, where 
PCC is the untransforaed probability of correct classification 
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PCC « max[a^P^(x), a2P 2 (x)]dx 
R n 

= a^P^Cx)dx + a 2 R 2^ x ^ x 
R 1 R 2 

R^ and R2 are the Bayes regions in R n : 

^ - {x £ R n I a^P^Cx) > a 2 P2(x)} 

R 2 = (x e R n I a 1 P 1 Cx) < a 2 P 2 Cx)}. 

It is shown in [ 3 ], that B is sufficient if and only if B ^(R^(B)) = R^ 

and B ^(R2(B)) = R 2 up to sets of measure zero. By a straightforward cal- 

T -1 

culation it follows that for B q = E > 

B o 1(R l< E o» - R 1 

and 

C ( W> - R 2 

Thus B is sufficient and 
o 

PCC = erf (•— /(u 1 -y 2 ) T E" 1 (ia 1 -vi 2 )). 
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3. The Equal Mean Case 


If )j£ = = 0, the equation H(y,B) = 0 


B£ Ij 

0 = B(2 1 -S 2 )B T y 2 + (Br^ 1 ) (BS 2 B T ) [In— ^ 

BE^B 


reduces to 


+ In 



2 

]. 


In order to avoid complications we will assume throughout this section that 
= o *2 = although the results also hold for unequal apriori probabilities. 
Thus, 


0 = B(E 1 -E 2 )B T y 2 + (BEjB 1 ) (BE 2 B T Hn 


BE 2 B T 

k/ ' 


The roots of this equation are -a and a, where 


(BEjB 1 ) (B^B 1 ) BEjB 1 

T T T 

be l b - be^ 1 be 2 b 


For either R^(B) - (-a, a) or R 2 (B) - (-a, a), substitution into equation 
(3) gives 

T T 

CE,B CE.B 

<5h(B;C) = — ~ pp . 

BEjB bS 2 B 

Thus if B maximizes h(B), then 



BE 1 B T 

T 

be 2 b x 


e 2 b 


T 



7 


T 

which is satisfied if and only if B is an eigenvector of Z 0 X Z , . The 

BE/ 2 1 

corresponding eigenvalue is X => -. Note that R (B) = (-a, a) if 

BS 2 B r 1 

X < 1 and R 2 (B) » (-a, a) if X > 1. Assuming R^(B) - (-a, a), the 
transformed probability of correct classification is 


-a a 

h(B) = \j P 2 (y,B)dy + \ j P 1 (y,B)dy 

' 00 — 3 , 



(y»B)dy 


2 + e rf ( 


( BE /) 1 / 2 


) - erf (- 


be 2 b t ) 1 / 2 


) 


J + erf £n X) - erf Jin X) 


= f(X), 


while if ^ 2 ^^ ~ (~ a » a )> then 

h(B) - f(i) = 1 - f(X). 

It is easy to show that f'(X) < 0 for X € (0,1). Hence h(B) is maximized 
when min{X, is as small as possible. -The result may be stated as follows. 

Theorem : Let TT^ and tt 2 be normally distributed populations in R n with equal 

means and covariance matrices E, and E„ respectively. Lef X . and X be 

1 2 min max 
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respectively the smallest and largest eigenvalues of • If ^min < — * 

m , max 

T —1 

then h(B) is maximized for B any eigenvector of Z ^ corresponding to 

T 

X , . Otherwise h(B) is maximized for B any eigenvector corresponding to 
min 

X . 

max 

4. Feature Reduction to k > 1 Dimensions. 

If B is a rank k k*n matrix, it is possible to derive an expression 
for 6h(B;C) , where C is a kxn matrix. Unfortunately, the resulting 
variational equation involves integrals over the k-dimensional regions R-^(B) 
and R 2 (B) which are difficult to evaluate. Thus, it would be desireable 
to have a procedure for constructing a kxn matrix one row at a time which 
maximizes or nearly maximizes h(B) . If Q is a nonsingular kxk matrix, 
then h(Q B) = h(B). Thus, it can be assumed that the rows of B are orthogonal, 
or in the two population case, that B£jB and are both diagonal 

ma trices. The following procedures are immediately suggested. Choose a l x n 

nonzero vector B^ to maximize h(B) . Having constructed B^ 

( l < n) choose a nonzero lxn vector which maximizes h(B) subject 

to the constraints 


Wi - 0 


i = 


or to *Wl B i = B M S 2 B i " ° 1-1 %m 


B ! . 

Let B = I * I be the feature selection matrix for reduction to k dimen- 
k It. 

k, 

sion. Clearly h(B^) - h(B 2 ) - ^ m PCC, since B^ = (I e |z)B£ + ^, 
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where I is the £x£ identity matrix and Z is an £xl zero vector. In 
order to justify the use of either of these procedures it would be desireable 
to have a nonzero lower bound on “ MB^) when is not sufficient. 

The orthogonality constraint is computationally more attractive since it is 
easy to compute the projection onto the constraint space at each step and 
incorporate It into a steepest descent procedure. However, the other con- 
straint leads to nice theoretical results when applied to the two population 
problem with equal population means. 

Suppose ]i^ = 1^2 = 0 and B^ c h° sen according to the theorem in the 

last section. If maximizes h(B) subject to the constraints 

T T 

B 2^1 B 1 = ~ and k differentiable at B 2 > then there are 

scalars and such that 


2 \ = X,2,B* + LZ b! 
B 2 E 2 B 2 


Since B., is an eigenvector of 


corresponding to an eigenvalue 6., 


E 2 B 2 T 

— -t - (A.g + 

B 2 Z 2 B 2 




The conditions B^S^B^ “ b j£ 2 B 2 = 0 lead to 


0 = 3 , b 1 I 2 bJ 
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and 3 ? =0. But then B ^ is also an eigenvector of 

be shown that at the (&+l)st step , the lxn vector 

T T 

subject to the constraints “ B £+1^2 B i’ 1 = 

vector of ^^l’ T hus the rows of are the k 

to the largest or smallest eigenvalues of 


It can easily 
B^ +1 maximizing h(B) 
is an eigen- 

eigenvectors corresponding 
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